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THE MATHEMATICAL REPRESENTATION OF 
FREQUENCY DISTRIBUTIONS 

By Harry C. Carver, University of Michigan 



Section I. distributions of graduated variates* 
Section II. distributions of integral variates* 
Section III. difference equation graduation 
Section IV. application of the hypergeometric series 

\A) U x pn O r _ x qn^x 
(B) U x = pn H r _ x qn H x 



Section III 

difference equation graduation 

Certain geometrical properties of unimodal frequency distributions 
suggest that any associated frequency function may be represented 
as a solution of the difference equation. 

q) Ay x ^ t/ x (q-a;) 

A x f(x) 

since 

(a) if there be one mode only there must be a value of x = a for which 

Ay x = 0, and 

(b) towards the extremes, the finite difference between two succes- 

sive ordinates must approach zero as y x diminishes in value. 

The balance of the difference equation of the unknown theoretical 
law of distribution may be represented by a function, f{x), appearing 
in the denominator. 

We shall now assume that f{x) may be expanded in a power series 
which in practice is found to be rapidly convergent. The merits of 
this important assumption will be discussed briefly later. 

Expressing (1) then as 

(b +bix+b 2 x 2 + . . . )Ay x =(a-x)y x ■ Ax, 
multiplying through by x n and summing with respect to x yields 
(2)6 Sa; n A2/ x +6 1 Sx n+1 A2/ I +6 2 2a; n+2 A2/ x + . . . = (aZx n y x -?:x n+1 y x )Ax. 

* Sections I and II of this paper appeared in the June issue of the Quabtehlt Publications. 
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If the range of the distribution be from x= — oo to £ = oo , we have 
by finite integration by parts 

2 x n Ay x =x n y x -2{(x+Ax) n -x n )y x+Ax ] 

X=—tX> X = — CO 

= -S{^-(s-A*)"}jJ 

(3J =-nC 1 2x n - 1 y x Ax+ n C 2 Ex n - 2 y x (Axy- . . . 

When dealing with distributions of graduated variates, Ax should 
be permitted to approach zero as a limit. 
Thus, equations (1), (2), and (3) become 

(la) I^« °-l 

ydx b a +biX+b 2 x 2 + . . . 

(2a) 6 [aTdy+lh [x n+1 dy+b 2 x n+ ' 2 dy+ . . . =ajx n ydx- x n+1 ydx 

(3a) x n dy=-N ■ nv' n -!. 

If we choose the mean of the distribution as origin and give n in 
(2a) successively the values 0, 1, 2, . . . we obtain 

a +&i + . . . =0 

bo +3P2&2+ • • • —V2 

via +3i»3&i+4i's&2+ ... =j>3 

v%a-\- 3vJ) +4j» 3 6i+ 5j> 4 &2 + . . . =vt 
etc. 

For distributions of graduated variates we take the common differ- 
ence between any two successive class magnitudes as the unit for x, 
(thus Ax=l) and giving n successively the values 0, 1, 2, . . . as 
before we obtain from (2) and (3) 

a +6, -h+... =0 

b -&i + (3»»+l)6»+... -vi 

v 2 a —b a + (3y 2 +l)6i+ (4v» — Qv 2 — l)b 2 + . . . =v 3 

k v 3 a+(3v 2 +l)b< ) +(4v i -6v 2 -l)b 1 +(5v i -10v 3 +10i>2+lb 2 + . . . = n 

etc. 

It should be noted that the moments of (4a) are defined by 



(4a) 



(4) 



p n = I x n ydx 



whereas those of (4) are given by 

i>« = 2x n y x . 
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A simultaneous solution of equations (4a) determines the constants 
of the differential equation (la), which on integration produce Pearson's 
system of Generalized Probability Curves. 

Equations (4) likewise determine the constants for the correspond- 
ing difference equation (1), when Ax = l. 

If for a particular distribution the series ba+bix+b 2 x 2 + . . . con- 
verges rapidly and it is possible to neglect all terms containing powers 
of jc greater than the second, solutions of (4a) and (4) yield, letting 

ft-^ and ft— ^. 

J<2 V<f 





TABLE 


i 


Differential values 


Const. 


Difference values 


-5(A+3) 


a 

h 

6 2 


2^5ft-6ft-9+i~) 
^4ft-3ft-I^ 
2^5ft-6ft-9+i~) 
hi— a 
2ft-3ft-6+I 


2(5ft-6ft-9) 

* 2 (4ft-3ft) 
2(5ft-6ft-9) 

— a 
2ft -3ft -6 


2(5ft-6ft-9) 


2^5ft-6ft-9+I^ 



If the series, f(x), converges so rapidly that the term b^x* may also 
be neglected, that is in cases where the value of 6 2 is not appreciably 
greater than its probable error, we have 





TABLE 


ii 


Differential values 


Const. 


Difference values 


_ Vi 

2X2 
VI 

—a 


a 
6i 


— Vl _ 1 

~2n 2 
V2— a 

— a 
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In application, the difference equation has certain advantages over 
the differential equation. Thus, a knowledge of the values of the con- 
stants of the differential equation permits us to compute the theoretical 
ordinates only after the integration of the differential equation, the 
constant of integration being determined by imposing the condition 
that the sums of the graduated and ungraduated frequencies must be 
equal. 

The difference equation, however, requires no integration, since a 
knowledge of the constants permits us to compute first all necessary 

values of — ^and then ^±1 from which ordinates proportional to those 

Vx Vx 

required may be computed by successive multiplication. The con- 
dition that the sum of the graduated frequencies must equal that of the 
ungraduated determines the proper proportional factor. 

For numerical illustrations of the use of the difference equation 
method for graduating complete distributions as well as "stumps" of 
distributions, reference may be made to "On the Graduation of Fre- 
quency Distributions" by the writer.* 

Section IV 

APPLICATION OF THE HYPERGEOMETRIC SERIES 



(A) 


™x pn ^r—x qn ^ x 


(B) 


™x pn "r—x qn **■'. 



Critical investigations of the variations which are found to exist in 
apparently homogeneous statistical data have led to the development 
of various theories of frequency distribution. 

One of the first of these, known to biologists as Quetelet's Law, 
states that the distribution of individuals ranked according to some 
common character in a frequency series may be represented by the 
successive terms of the expansion of the point binomial 

N(p+qY, 
that is to say 

(1) N{p r + r C 1 p T - 1 q+ T C 2 p T - 2 q>+ . . . +q r } 

where p-\-q = \ 

N = the total frequency of the distribution. 
r=an integer, representing, therefore, one less than the 
number of classes in the theoretical distribution. 

* Published in the Proceedings of the Casualty Actuarial and Statistical Society of America, vol. vi, 
part 1, No. 13. 
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A student of probabilities, however, is more apt to associate the name 
of Bernoulli with this series, since the successive terms are merely the 
frequency expectations of r, r— 1, r — 2, . . . occurances in N trials 
of r independent events each, where the probability of the happening 
of each event is designated by p and of its non-happening by q. 

The difference equation of Quetelet's Law, taking the position of 
the first term as origin, is 

y x+1 _ N- T C x+1 p r -* -y +1 _ (r-s)g 
y x N ■ T C x p T - x q* (x+l)p 

Ay x rq-p-x 



or 



Vx {x + l)p 



If the origin be now shifted to the mean, rq units distant, the new 
difference equation becomes 



(2) 

which is of the form 



&Vx -p-x 



Vx p(l + rq)+px 

Ay x a — x 
y x bo+hx 



From the values of the constants of the difference equations given 
in Table II we have 

(3) P = ^±* q = v lZll, r = W . 

2v 2 Ivi V2 2 — P3 2 

From the above we see that if p, q, and r are to have any real sig- 
nificance, the absolute value of v 3 must be less than v 2 . Otherwise r 
would be negative and either p or q would be greater than unity. 

An important limit of Quetelet's Law is obtained by permitting q 
to approach zero and r infinity in such a manner that the product rq, 
representing the distance from the origin of the series to the mean, 
remains a constant and equal to m. 

This limit, 



(4) * 



1_L _L m2 _l_ _1_ WX _L 

l +m+ + . . . + + 



] 



is known as Poisson's Exponential Binomial Limit, and is often referred 
to as the Law of Small Numbers. 

The criterion for this series is obviously Vi = v 3 . 
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In the Philosophical Transactions of the Royal Society of London (vol. 
186, part 1, p. 360), Pearson presents a generalized series which is more 
general and powerful than the point binomial mentioned above. 

It may be developed as follows : 

If from a bag containing pn black and qn white balls, r balls are with- 
drawn without replacements, the chances that the r balls withdrawn 
will contain r, r — 1, r — 2, . . . 2, 1, black balls are given by the 
successive terms of the hypergeometric series 



n^r 



The difference equation of this series, referred to the mean which 
is rq units distant from the first term, is 



,qs &y x _ (r 2p-l-pn — l)-(n+2)x 



(r 1 — p + l+x)(pn — r+l+x) 



Comparing equation (6) with (4) of Section III, we obtain the fol- 
lowing for the hypergeometric series u x = pn C T - xqn C x : 



(7) 



n 
j> 2 = rpq 



n-1 

. n-2r 

vs = vi{p-q) - 

n — 2 



„ 4 = 5 h(n-l)(n+6-—U+n(n + l)-6pqiA. 

(n-2)(n-3H pq' ) 



But here again we find that for distributions of integral variates our 
results are unintelligible unless v 3 is in absolute value less than v 2 . 

Again, since for this series 62 = we have from Table I, 

s ' n+2 

(8) n= gCftzftzlL ; 

2ft. -30!- 6+ - 

V-l 

and a few trials will convince one that for many of the distributions 
that are met in practice this solution yields a negative value for n, 
and that this occurs when v 3 > v 2 . 

If we now consider the hypergeometric series 

^x ^pn"r— x ' qn**x 
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where n H r denotes the number of combinations of n things taken r 
at a time when repetitions are allowed, i. e., n H r =„+>•- 1 C„ we have 



(9) 



Ay* . 

Vx 



(r 1 — 2p + 1 — pn) — {n — 2)x 
(r 1— p + l+x)(p n+r — l—x)' 



where the mean, which is also at rq, is taken as the origin. 
Comparing equation (9) with (4) of Section III we obtain for the 



series u x — p„ ki T — x • ^ H x 



(10) 



vi = rpq 



n+r 
n+1 



Vi = vz(p-q) 



n+2r 
ft+2 



i'f- 



"2 



(n+2)(n+3) 



<3(n+l)(n-6+— )v 2 +n(n-l)-6pqn 2 }. 
I pq' ) 



It may now be noted that equations (9) and (10) may be obtained 
directly from (6) and (7) by replacing n in the latter by (— »). 

If for convenience we designate the point binomial or Bernoulli's 
series by Series B, the hypergeometric series u x = pn C r - xqn C x as Series 
C, and u x = pn H r ^ xqn H x as Series H, we see that Series H may be vised 
for those distributions for which Series B and C are meaningless. 

An analysis of the means and dispersions of these three series is 
enlightening. From the following table we note that although their 

TABLE III 



Series 


Mean 


Dispersion 


C 


rz 


n — v 


B 


<•« 


rpq 


H 


rq 


n-\-r 
rpq—— 

71+1 



means are identical the dispersion of Series C is always less and that 
of Series H greater than the Bernoullian dispersion. Moreover, as n 
approaches infinity, the dispersions of Series C and H approach the 
Bernoullian as a limit from opposite sides. 

Inasmuch as the Bernoulli Series, because of its rather extensive 
degree of freedom, is itself a powerful "closed" graduation function, it 
follows that the combination of these three series — affording an addi- 
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tional continuous degree of freedom — is capable of graduating practi- 
cally any unimodal distribution. 

These considerations throw an interesting light on the convergence 
of f(x) =bo+hx+b2X 2 + . . . in the denominator of either the differ- 
ence or differential equation. If we stop with b t x the freedom is 
restricted to that of a point binomial, and the addition of 6 2 z 2 increases 
the freedom to at least that of the hypergeometric series. 

At the present time tables, based on formulae (7) and (10), are being 
prepared which will enable one to obtain by inspection the proper 
values of p, q, r, and n when the values of the moments are known. 
By this method it is hoped that a simple method of graduating fre- 
quency distributions may be available, and, what is more important, 
that something may be accomplished in the direction of classifying 
distributions. 



